fix: clean up Temporal server-side versioning data on TWD deletion by anujagrawal380 · Pull Request #240 · temporalio/temporal-worker-controller

anujagrawal380 · 2026-03-24T18:18:04Z

Add a finalizer to TemporalWorkerDeployment to run Temporal server-side cleanup before K8s deletion
Add a finalizer to TemporalConnection to prevent it from being deleted while any TWD still references it
On TWD deletion, set current version to unversioned, clear ramping version, and delete registered versions

Problem

When a TemporalWorkerDeployment CRD is deleted (e.g., switching back to plain Deployments), the Temporal server retains the build ID routing configuration. The matching service continues routing new tasks to the deleted build ID's physical queue, while unversioned workers poll a different physical queue. Tasks sit in Scheduled state indefinitely with no errors.

A secondary race condition exists: Helm deletes both the TemporalConnection and TWD in the same upgrade. Without the connection, the controller cannot talk to Temporal to clean up. This is solved by adding a finalizer to the TemporalConnection that blocks its deletion until all referencing TWDs are gone.

Changes

internal/controller/worker_controller.go:

TWD finalizer (temporal.io/worker-deployment-cleanup):

Added to all TWD resources during normal reconciliation
On deletion, triggers handleDeletion() which:
1. Sets the current version to unversioned (BuildID: "") -- the critical step that unblocks task dispatch
2. Clears any ramping version
3. Deletes all registered versions with SkipDrainage: true
4. Attempts to delete the deployment record itself
5. Removes the connection finalizer if no other TWDs reference it
6. Removes its own finalizer, allowing K8s to complete deletion

TemporalConnection finalizer (temporal.io/connection-in-use):

Added to the TemporalConnection during normal TWD reconciliation via ensureConnectionFinalizer()
Prevents the connection from being deleted while any TWD still references it
Removed by removeConnectionFinalizerIfUnused() during TWD deletion, after checking no other TWDs in the same namespace reference the connection
Guarantees the connection is always available during TWD cleanup -- no race condition with Helm deleting both resources simultaneously

RBAC updates:

Added update;patch verbs for temporalconnections (was get;list;watch)
Added update verb for temporalconnections/finalizers

Deletion flow

Helm upgrade (TWD disabled)
  |
  v
Helm deletes TWD CRD + TemporalConnection CRD simultaneously
  |
  +--> TemporalConnection: has finalizer, K8s sets DeletionTimestamp but blocks deletion
  |
  +--> TWD: has finalizer, K8s sets DeletionTimestamp, triggers Reconcile
         |
         v
       handleDeletion() runs:
         1. Fetches TemporalConnection (guaranteed to exist via finalizer)
         2. Connects to Temporal server
         3. Sets current version to unversioned
         4. Deletes versions
         5. Removes connection finalizer (no other TWDs reference it)
         6. Removes TWD finalizer
         |
         v
       TWD deleted by K8s
         |
         v
       TemporalConnection: no more finalizers, deleted by K8s

Issue #55
Closes #166

CLAassistant · 2026-03-24T18:18:12Z

All committers have signed the CLA.

anujagrawal380 · 2026-03-24T18:20:22Z

PTAL @carlydf

jaypipes

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

anujagrawal380 · 2026-03-25T15:30:44Z

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Thanks, resolved both the comments!

jaypipes

rock on :) nice work on this @anujagrawal380!

carlydf

we need integration tests for this before merging to main / including it in a release

anujagrawal380 · 2026-03-26T07:01:34Z

we need integration tests for this before merging to main / including it in a release

@carlydf Added the integration tests. PTAL

carlydf · 2026-04-22T16:59:07Z

Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release

…blocking unversioned workers Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

…ion finalizer - Add 5-minute deletionCleanupTimeout to prevent TWD stuck in Terminating state indefinitely if Temporal server is unavailable - Return errors from version/deployment deletion to trigger requeue until versions actually clear (pollers disappear as pods terminate) - Add update/patch verbs and finalizers RBAC marker for TemporalConnections - Fix comment-spacing lint on new kubebuilder:rbac markers

anujagrawal380 · 2026-04-22T18:10:27Z

Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release

@carlydf @jaypipes Added few more minor improvements here: 9fd0c74 . PTAL!

carlydf · 2026-04-23T04:07:06Z

+	// deletionCleanupTimeout is the maximum duration to retry Temporal server-side
+	// cleanup before giving up and allowing the K8s resource to be deleted.
+	// This prevents the TWD from being stuck in Terminating state indefinitely
+	// if the Temporal server is unavailable or a version has persistent active pollers.
+	deletionCleanupTimeout = 5 * time.Minute


I think the goal of this finalizer is for the kubernets / TemporalWorkerDeployment perspective and the Temporal server perspective to be aligned. So if the server-side object was created by creating the k8s-side object, the server-side object should also be deleted in the same way.

This timeout breaks that expectation. If the server is temporarily unavailable and this finalizer gives up and deletes the k8s-side object, the server-side object would never be deleted.

I'm curious if you ran into this while testing? The default TTL for "active pollers" in server is 5 minutes, so if when the TWD enters Terminating state it is running active pollers, the controller needs to kill those Deployments and then wait 5 minutes before the "no active pollers" check passes. If all versions were Drained and the pods had been scaled down for a while, this delay wouldn't exist.

Because of that 5 minute poller TTL, a 5 minute deletionCleanupTimeout would frequently be used in case of deletion before natural scaledown, which would IMO not be good (because of the leftover server-side object thing I explained above).

If we decide to keep this, I would advocate for:

A very long threshold, like 1h

Only timing out on unavailable errors from the server, not precondition failed (which is the "active pollers" thing)

I will defer to @jaypipes opinion here though!

anujagrawal380 requested review from a team and jlegrone as code owners March 24, 2026 18:18

jaypipes reviewed Mar 25, 2026

View reviewed changes

Comment thread internal/controller/worker_controller.go Outdated

Comment thread internal/controller/worker_controller.go

jaypipes approved these changes Mar 25, 2026

View reviewed changes

carlydf requested changes Mar 25, 2026

View reviewed changes

carlydf approved these changes Apr 16, 2026

View reviewed changes

anujagrawal380 added 5 commits April 22, 2026 23:13

fix: deleting a TWD leaves stale versioning data on Temporal server, …

b1a1dda

…blocking unversioned workers Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

fix: persist connection until twd deletes for cleanup

0027f6b

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

fix: add else log for no ramping version

95b3661

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

fix: use single temporal.io/delete-protection finalizer

7050931

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

fix: integration tests

8d91494

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>

anujagrawal380 force-pushed the fix/twd-leaves-stale-versioning-data branch 2 times, most recently from 22607bf to 5a2e844 Compare April 22, 2026 18:06

anujagrawal380 force-pushed the fix/twd-leaves-stale-versioning-data branch from d6a305c to 9fd0c74 Compare April 22, 2026 18:07

carlydf requested changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clean up Temporal server-side versioning data on TWD deletion#240

fix: clean up Temporal server-side versioning data on TWD deletion#240
anujagrawal380 wants to merge 6 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data

anujagrawal380 commented Mar 24, 2026 •

edited by carlydf

Loading

Uh oh!

CLAassistant commented Mar 24, 2026 •

edited

Loading

Uh oh!

anujagrawal380 commented Mar 24, 2026

Uh oh!

jaypipes left a comment

Uh oh!

Uh oh!

Uh oh!

anujagrawal380 commented Mar 25, 2026

Uh oh!

jaypipes left a comment

Uh oh!

carlydf left a comment

Uh oh!

anujagrawal380 commented Mar 26, 2026

Uh oh!

carlydf commented Apr 22, 2026

Uh oh!

anujagrawal380 commented Apr 22, 2026 •

edited

Loading

Uh oh!

carlydf Apr 23, 2026

Uh oh!

carlydf Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anujagrawal380 commented Mar 24, 2026 • edited by carlydf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Deletion flow

Uh oh!

CLAassistant commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anujagrawal380 commented Mar 24, 2026

Uh oh!

jaypipes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anujagrawal380 commented Mar 25, 2026

Uh oh!

jaypipes left a comment

Choose a reason for hiding this comment

Uh oh!

carlydf left a comment

Choose a reason for hiding this comment

Uh oh!

anujagrawal380 commented Mar 26, 2026

Uh oh!

carlydf commented Apr 22, 2026

Uh oh!

anujagrawal380 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlydf Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

carlydf Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anujagrawal380 commented Mar 24, 2026 •

edited by carlydf

Loading

CLAassistant commented Mar 24, 2026 •

edited

Loading

anujagrawal380 commented Apr 22, 2026 •

edited

Loading